\[ \mathbb{E}\left( y | x\right) = \mu = \mathbf{X}\beta \]
independently and identically distributed
linearity
normally distributed errors
homoskedasticity
Data points are
independent: knowing the value of one observation doesn’t give you information about other observations
identically distributed: all observations are from the same underlying distribution with constant variance
❓ can you think of kinds of data that might violate this assumption?
Risks with (unaccounted for) non-iid data:
biased standard error estimates
inflated error rates
inefficient estimates (don’t take into account dependence)
invalid p-values
Essentially…you have the WRONG model.
❓ can you think of an example where this might be a bad assumption?
❓ can you think of an example where this might be a bad assumption?
Linearity: As predictors (\(x_i\)s) increase, the amount our prediction increases is constant.
Additivity: The effect of predictors (\(x_i\)s) do not impact the effect of other predictors.
💡 Luckily there are some neat tricks to get around these while not technically violating them.
For basic linear regression we assume errors are distributed normally around the regression line.
no heavy tails (errors are more and more uncommon as they get larger)
no skewness (we’re not more likely to guess to high than too low)
No matter what the predicted value is, the error variance is constant.
Don’t worry, the rest of this module is about breaking these assumptions.
Least Squares
Maximum Likelihood Estimation
As the name implies, Least Squares aims to choose parameter values that *minimize* the sum of squared errors.
\[ \text{SSE} = \sum_{i = 1}^n(y_i - \beta_0 - \beta_1*x_i)^2 \]
SSE is one choice of a loss function.
Loss Function: a function that quantifies how “correct” the output of the model is. Lower values mean better model performance.
❓ why use a squared error rather than an absolute error?
Calculate the partial derivatives of each variable we can change (\(\beta_0\) and \(\beta_1\) here) w.r.t. the loss function.
\[\frac{\partial SSE}{\partial \beta_0} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-1)\] \[\frac{\partial SSE}{\partial \beta_1} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-x_i)\]
and set them equal to 0.
\[\frac{\partial SSE}{\partial \beta_0} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-1) = 0\] \[\frac{\partial SSE}{\partial \beta_1} = \sum_{i = 1}^n 2(y_i - \beta_0 - \beta_1*x_i)(-x_i) = 0\]
❓ why do we set the derivatives equal to 0?
then we solve for \(\beta_0\) and \(\beta_1\) and we get:
\[\beta_0 = \bar{y} - \hat{\beta_1}* \bar{x}\]
and
\[ \beta_1 = \frac{Cov(x,y)}{Var(x)} = Corr(x,y) * \frac{sd(x)}{sd(y)} \]
These values for \(\beta_0\) and \(\beta_1\) are the ones that minimize our Sum of Squared Errors (\(\text{SSE}\)) and therefore give us a model that performs very well.
💡 notice, we solve for the optimal values directly in this case, however we can also solve these iteratively.
Once we solve for the coefficients, we can use them to understand relationships between variables.
\(\beta_0\): when all the coefficients are 0, what is the predicted value?
0 a meaningful number\(\beta_1\) (and others): when the predictor increases by 1 unit, how much does the predicted value change?
\(\beta_n\) is working on a continuous variable: an increase in 1 unit of \(x\) will result in a \(\beta_n\) change in the predicted value.
\(\beta_n\) is working on a categorical variable: being in the current (non baseline) category will result in a \(\beta_n\) change in the predicted value.
💡 Remember: the intercept is the predicted value *when all \(x\)s are equal to 0.
for continuous values: this is an actual 0 or often, if z-scored, 0 represents the average/mean
for categorical values: 0 represents all indicator variables being 0, which is the reference category
\[ p(y | x, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x- \mu}{2\sigma^2}} \]
In Linear Regression, the likelihood function is the normal distribution with parameters \(\theta = \left( \mu, \sigma\right)\)
We want to choose values of \(\theta\) maximize the likelihood of the data, \(\left(x,y\right )\)
For a single data point the value of the likelihood function, \(L\left( \theta | x_i \right)\) is:
\[ \mathcal{L} \left( \theta | y_i \right) = p(y_i | x_i, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu}{2\sigma^2}} \]
where \(\mu = g^{-1}(\mathbf{X}\beta)\).
Because of our assumption that data points are independent (the first i in i.i.d), the likelihood value for all data points is simply the product of their individual likelihood values, since \(p(A,B) = p(A)*p(B) \text{ iff } A \mathrel{\unicode{x2AEB}} B\).
\[
\mathcal{L}\left(\theta | \mathbf{y} \right) = p(\mathbf{y} | \mathbf{x}, \theta) = \prod_{i=1}^n p(y_i | x_i, \theta) = \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu_i}{2\sigma^2}}
\]
The higher the likelihood of our data, the more evidence that a particular \(\theta\) is a good fit for the data.
\[ \text{arg max}_{\theta} \left[ \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu_i}{2\sigma^2}} \right] \]
to maximize, we:
But…
…taking the derivative of a function of products is hard, so we use log likelihood.
\[ \mathcal{l}\left(\theta | \mathbf{y} \right) = \log\left(\mathcal{L}\left(\theta | \mathbf{y} \right)\right) = \\ -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log (\sigma^2) -\frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \mu_i)^2 \]
Note: \(\log()\) is a monotonically increasing function, so choosing \(\theta\) that maximizes \(\mathcal{l}\left(\theta | \mathbf{y} \right)\) will also maximize \(\mathcal{L}\left(\theta | \mathbf{y} \right)\)
If we choose our errors to be normally distributed, our likelihood function for a single data point was:
\[ \mathcal{L} \left( \theta | y_i \right) = p(y_i | x_i, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu}{2\sigma^2}} \]
but what if we used a different distribution…?
\[ \mathbb{E}\left (y | \mathbf{X}\right ) = \mu = g^{-1}\left (\mathbf{X}\beta \right ) \\ p\left (y|x\right ) \sim \pi\left (\mu,...\right ) \]
link function: \(g()\)
likelihood function: \(\pi()\)
\[ \mathbf{X}\beta \]
“linear in the parameters”
\[ \mathbb{E}\left( y | \mathbf{X} \right) = g^{-1}\left(\mathbf{X}\beta \right) \\ g\left(\mathbf{X} \right) = log\left( \frac{p}{1-p} \right); g^{-1}\left(\mathbf{X} \right) = \frac{1}{1+e^{-x}} \\ p\left (y | \mathbf{X} \right) \sim bernoulli\left( \mathbf{X}\beta \right) \]
Logit link function
Bernoulli likelihood
\[ \mathcal{L}\left( \theta | \mathbf{y}\right) = \prod_{i = 1}^n p(x_i)^{y_i} + \left[1- p(x_i) \right]^{1-y_i} \]
Notice, the likelihood function uses a Bernoulli distribution rather than a Normal distribution!
what range of values can \(\mu\) take on?
linear regression: \(\mu\) can be any value
logistic regression: \(p\) must be \([0-1]\)
Link Functions change the interpretation of coefficients in a GLM
logit: \(g^{-1}(x) = \frac{1}{1 + e^{-x}}\), \(g(x) = log \left( \frac{x}{1-x}\right)\)
an increase in 1 unit of \(\mathbf{X}_i\) is associated with a \(\beta_i\) increase in the log odds of \(y\)
an increase in 1 unit of \(\mathbf{X}_i\) is associated with a \(e^\beta\) times increase in the odds of \(y\)
how are errors distributed? what are the actual values of \(y\)?
linear regression: values are continuous, distributed normally
logistic regression: values are binary, distributed bernoulli
“fat tailed distributions, you make the rockin’ world go round”
❓ if you use a student t distribution as your likelihood function, how does the likelihood of extreme values change? What impact would that have on your regression?
link function: identity \(g(x) = x\), \(g^{-1}(x) = x\)
likelihood function: student-t with \(\nu\) degrees of freedom
\[ \mathbb{E}\left( y | \mathbf{X} \right) = g^{-1}\left(\mathbf{X}\beta \right) \\ g\left(\mathbf{X} \right) = \mathbf{X} ; g^{-1}\left(\mathbf{X} \right) = \mathbf{X} \\ p\left (y | \mathbf{X} \right) \sim t\left( \mathbf{X}\beta, \nu \right) \]
Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
using SDK: ‘MacOSX14.4.sdk’
clang -arch arm64 -std=gnu2x -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/Dense:1:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/Core:19:
/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
#include <cmath>
^~~~~~~
1 error generated.
make: *** [foo.o] Error 1
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 2.3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.23 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.008 seconds (Warm-up)
Chain 1: 0.007 seconds (Sampling)
Chain 1: 0.015 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.008 seconds (Warm-up)
Chain 2: 0.006 seconds (Sampling)
Chain 2: 0.014 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.007 seconds (Warm-up)
Chain 3: 0.007 seconds (Sampling)
Chain 3: 0.014 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.01 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.008 seconds (Warm-up)
Chain 4: 0.007 seconds (Sampling)
Chain 4: 0.015 seconds (Total)
Chain 4:
Family: gaussian
Links: mu = identity; sigma = identity
Formula: y ~ x
Data: o (Number of observations: 50)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.26 0.20 -0.15 0.65 1.00 4083 2973
x 0.12 0.21 -0.31 0.53 1.00 3632 2735
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 1.48 0.15 1.21 1.81 1.00 3696 2910
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
using SDK: ‘MacOSX14.4.sdk’
clang -arch arm64 -std=gnu2x -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/Rcpp/include/" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/unsupported" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/BH/include" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/src/" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppParallel/include/" -I"/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/Dense:1:
In file included from /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/Core:19:
/Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
#include <cmath>
^~~~~~~
1 error generated.
make: *** [foo.o] Error 1
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 3.6e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.36 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.02 seconds (Warm-up)
Chain 1: 0.019 seconds (Sampling)
Chain 1: 0.039 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 3e-06 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.021 seconds (Warm-up)
Chain 2: 0.019 seconds (Sampling)
Chain 2: 0.04 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 3e-06 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.022 seconds (Warm-up)
Chain 3: 0.019 seconds (Sampling)
Chain 3: 0.041 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 3e-06 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.02 seconds (Warm-up)
Chain 4: 0.019 seconds (Sampling)
Chain 4: 0.039 seconds (Total)
Chain 4:
Family: student
Links: mu = identity; sigma = identity; nu = identity
Formula: y ~ x
Data: o (Number of observations: 50)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept 0.10 0.14 -0.17 0.38 1.00 4000 2454
x 0.57 0.16 0.25 0.87 1.00 3638 2731
Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma 0.77 0.13 0.55 1.06 1.00 3550 2925
nu 3.57 1.63 1.64 7.48 1.00 3196 2711
Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Assume \(y \in [0,1]\), what would be a problem with using linear regression to predict \(y\)?
Note: technically, beta regression is not a GLM. But if it walks like a 🦆 and quacks like a 🦆, surely…
link function: logit \(g^{-1}(x) = \frac{1}{1 + e^{-x}}\), \(g(x) = log \left( \frac{x}{1-x}\right)\)
likelihood function: beta distribution with mean \(\mu\)
\[ \mathbb{E}\left( y | \mathbf{X} \right) = g^{-1}\left(\mathbf{X}\beta \right) \\ g\left(\mathbf{X} \right) = log\left( \frac{p}{1-p} \right); g^{-1}\left(\mathbf{X} \right) = \frac{1}{1+e^{-x}} \\ p\left (y | \mathbf{X} \right) \sim beta\left( \mathbf{X}\beta, \kappa \right) \]
Poisson Regression works with count data.
\[ \text{PMF} = \frac{\gamma^ke^{-\gamma}}{k!} \]
link function: \(g(x) = log(x)\), \(g^{-1}(x) = e^x\)
likelihood function: Poisson \(Pois(\gamma)\)
Note: for the Poisson Distribution, \(\gamma = \mu = \sigma^2\)
link function: \(g(x) = log(x)\), \(g^{-1}(x) = e^x\)
likelihood function: negative binomial \(NB(\mu, \alpha)\)
When \(\alpha\) → 0: approaches Poisson (equidispersion)
When \(\alpha\) > 0: overdispersion (variance > mean)